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Abstract 

We address a fundamental problem arising from analysis of biomolecular sequences. The 
input consists of two numbers w m i n and w max and a sequence S of n number pairs (dj, Wi) with 
Wi > 0. Let segment S(i,j) of S be the consecutive subsequence of S between indices % and j. 
The density oiS(i,j) is d(i,j) = (a; + ai + i + - ■ •+a,j)/(wi+Wi+i + - ■ -+Wj). The maximum-density 
segment problem is to find a maximum-density segment over all segments S(i,j) with w m - ul < 
Wi + Wi+i + ■ ■ ■ + Wj < i(j m ax- The best previously known algorithm for the problem, due to 
Goldwasser, Kao, and Lu, runs in O (n log ( w max — w nl [ n + 1 ) ) time. In the present paper, we solve 
the problem in 0(n) time. Our approach bypasses the complicated right-skew decomposition, 
introduced by Lin, Jiang, and Chao. As a result, our algorithm has the capability to process the 
input sequence in an online manner, which is an important feature for dealing with genome- 
scale sequences. Moreover, for a type of input sequences S representable in 0(m) space, we 
show how to exploit the sparsity of S and solve the maximum-density segment problem for S 
in 0(m) time. 

1 Introduction 

We address the following fundamental problem: The input consists of two numbers u> m in and 
w max and a sequence S of number pairs (a,, Wi) with Wi > for i = 1, . . . , n. A segment S(i,j) is a 
consecutive subsequence of S starting with index i and ending with index j. For a segment S(i,j), 
the width is w(i, j) = wi + lOj+i + • • • + Wj, and the density is d(i, j) = (a, + aj+i + • • • + aj)/w(i, j). 
It is not difficult to see that with an 0(n)-time preprocessing to compute all 0(n) prefix sums 
o-i + a-2 + • • • + a-j and w\ + W2 + • • • + Wj, the density of any segment can be computed in 0(1) 
time. S(i,j) is feasible if w m ; n < w(i,j) < w max . The maximum-density segment problem is to find a 
maximum-density segment over all 0(n 2 ) feasible segments. 

This problem arises from the investigation of non-uniformity of nucleotide composition within 
genomic sequences, which was first revealed through thermal melting and gradient centrifugation 
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experiments [19,26]. The GC content of the DNA sequences in all organisms varies from 25% to 
75%. GC-ratios have the greatest variations among bacteria's DNA sequences, while the typical 
GC-ratios of mammalian genomes stay in 45-50%. Despite intensive research effort in the past 
two decades, the underlying causes of the observed heterogeneity remain debatable [2, 3, 5, 8- 
11,16,38,40]. Researchers [30,37] observed that the compositional heterogeneity is highly cor- 
related to the GC content of the genomic sequences. Other investigations showed that gene 
length [7], gene density [42], patterns of codon usage [35], distribution of different classes of repet- 
itive elements [7, 36], number of isochores [2], lengths of isochores [30], and recombination rate 
within chromosomes [12] are all correlated with GC content. More research related to GC-rich 
segments can be found in [14, 15, 18, 21, 27, 29, 33, 39, 41] and the references therein. 

In the most basic form of the maximum-density segment problem, the sequence S corresponds 
to the given DNA sequence, where cjj = 1 if the corresponding nucleotide in the DNA sequence is 
G or C; and = otherwise. In the work of Huang [17], sequence entries took on values of p and 
1 — p for some real number < p < 1. More generally, we can look for regions where a given set of 
patterns occur very often. In such applications, a, could be the relative frequency that the corre- 
sponding DNA character appears in the given patterns. Further natural applications of this prob- 
lem can be designed for sophisticated sequence analysis such as mismatch density [34], ungapped 
local alignments [1], annotated multiple sequence alignments [37], promoter mapping [20], and 
promoter recognition [31]. 

For the uniform case, i.e., Wi = 1 for all indices i, Nekrutendo and Li [30], and Rice, Longden 
and Bleasby [32] employed algorithms for the case w m - m = w mSLX/ which is trivially solvable in 
0(n) time. More generally, when w m - m / w mSLX , the problem is also easily solvable in 0(n(w max — 
w m i n + 1)) time, linear in the number of feasible segments. Huang [17] studied the case where 
i/] max = n, i.e., there is effectively no upper bound on the width of the desired maximum-density 
segments. He observed that an optimal segment exists with width at most 2w m \ n — 1. Therefore, 
this case is equivalent to the case with w mSLX = 2w m - m — 1 and can be solved in 0(nw m i n ) time in a 
straightforward manner. Lin, Jiang, and Chao [25] gave an 0(n log w m i n )-time algorithm for this 
case based on right-skew decompositions of a sequence. (See [24] for a related software.) The case 
with general w max was first investigated by Goldwasser, Kao, and Lu [13], who gave an O(n)- 
time algorithm for the uniform case. Recently, Kim [23] showed an alternative algorithm based 
upon a geometric interpretation of the problem, which basically relates the maximum-density 
segment problem to the fundamental slope selection problem in computational geometry [4, 6, 22, 28]. 
Unfortunately, Kim's analysis of time complexity has some flaw which seems difficult to fix. 1 For 
the general (i.e., non-uniform) case, Goldwasser, Kao, and Lu [13] also gave an 0(n log(u> max — 
w m i n + l))-time algorithm. By bypassing the complicated preprocessing step required in [13], 
we successfully reduce the required time for the general case down to 0(n). Our result is based 
upon the following set of equations, stating that the order of d(x, y), d(y + 1, z), and d(x, z) with 
x < y < z can be determined by that of any two of them: 

d(x, y) < d(y + 1, z) d(x, y) < d(x, z) & d(x, z) < d(y + 1, z); 
d(x, y) > d(y + 1, z) <3> d(x, y) > d(x, z) & d(x, z) > d(y + 1, z). 

1 Kim claims that all the progressive updates of the lower convex hulls Lj U Rj can be done in overall linear time. 
The paper only sketches how to obtain I/j+i U Rj+i from Lj U Rj . (See the fourth-to-last paragraph of page 340 in [23].) 
Unfortunately, Kim seems to overlook the marginal cases when the upper bound «j max forces the p z of Lj U Rj to be 
deleted from Lj+i U Rj+i. As a result, obtaining L j+1 U Rj+i from Lj U Rj could be much more complicated than 
Kim's sketch. A naive implementation of Kim's algorithm still takes fi(n(tu m ax — wj m m + 1)) time in the worst case. We 
believe that any correct implementation of Kim's algorithm requires Q(n log(w max — w m i n + 1)) time in the worse case. 
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Figure 1: An illustration for Equation (1): There are only two possibilities for the order among 

d(x,y),d(x,z),d(y + l,z). 



(Both equations can be easily verified by observing the existence of some number p with < p < 1 
andd(x,z) = p-d(x, y) + (l — p)-d(y+l, z). SeeFigurel.) Our algorithm is capable of processing the 
input sequence in an online manner, which is an important feature for dealing with genome-scale 
sequences. 

For bioinformatics applications, e.g., in [1, 20, 31, 34, 37], the input sequence S is usually very 
sparse. That is, S can be represented by m = o{n) triples (a[, w[, ni), (a' 2 , w' 2 , 112), ■ ■ ■ , {a' m , w' m , n m ) 
with = no < ri\ < n 2 < ■ ■ ■ < n m = n to signify that (a,, Wi) = (a'j,w'j) holds for all indices 

1 and j with nj_i < i < nj and 1 < j < m. If w'j = 1 holds for all 1 < j < m, we show how 
to exploit the sparsity of 5 and solve the maximum-density problem for S given in the above 
compact representation in 0(m) time. 

The remainder of the paper is organized as follows. Section 2 shows the main algorithm. 
Section 3 explains how to cope with the simple case that the width upper bound u) max is ineffective. 
Section 4 takes care of the more complicated case that w max is effective. Section 5 explains how to 
exploit the sparsity of the input sequence for the uniform case. 

2 The main algorithm 

For any integers x and y, let [x, y] denote the set {x, x + 1, . . . , y}. Throughout the paper, we need 
the following definitions and notation with respect to the input length-n sequence S and width 
bounds w m i n and w max . Let jo be the smallest index with w(l, jo) > ^mm- Let J = [io,n]. For 
each j G J, let £j (respectively, rj) be the smallest (respectively, largest) index i with u> m j n < 
w(i,j) < K) mm . That is, S(i,j) is feasible if and only if i € [£j,rj]. (Figure 3 is an illustration for 
the definitions of £j and rj.) Clearly, for the uniform case, we have l- L+ i = £{ + 1 and r^+i = r\; + 1. 
As for the general case, we only know that lj and rj are both (not necessarily monotonically) 
increasing. One can easily compute all £j and rj in 0{n) time. Let i* be the largest index k € [£j,rj] 
with d(k,j) = meLx{d(i,j) | i <G [£j,rj]}. Clearly, there must be an index j* such that S(ij*,j*) 
is a maximum-density segment of S. Therefore, a natural but seemingly difficult possibility to 
solve the maximum-density segment problem would be to compute z| for all indices j £ J in 
0(n) time. Instead, our strategy is to compute an index ij £ [£j,rj] for each index j £ J by 
the algorithm shown in Figure 2, where (p(x, y) is defined to be the largest index z G [x, y] that 
minimizes d(x, z). That is, S(x, cf)(x, y)) is the longest minimum-density prefix of S(x, y). The rest 
of the section ensures the correctness of our algorithm by showing ij* = and thus reduces the 
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algorithm MAIN 

1 let i jo -i = 1; 

2 for j = jo to n do { 

3 let = BEST(max(zj_i,^),rj, j); 

4 output 

5 } 

function BEST(7, r, j) 

1 let i = t; 

2 while i < r and d(i, (f)(i, r — 1)) < j) do 

3 let z = 4>{i,r - 1) + 1; 

4 return i; 



Figure 2: Our main algorithm. 



Figure 3: An illustration for the definitions of £j and rj. 



maximum-density segment problem to implementing our algorithm to run in 0(n) time. 

Lemma 1 The index returned by function call best(£, r,j) is the largest index i e [£,r] that maximizes 
d(i,j). 

Proof. Let i* be the largest index in [£, r] that maximizes d(i, j), i.e., d(i* , j) = max^wj d(i, j). Let 
ij be the index returned by function call BEST(^, r, j). We show ij = i* as follows. If ij < i* , then 
ij < r. By the condition of the while-loop at Step 2 of BEST, we know d(ij, <fi(ij,r — 1)) > d(ij,j). 
By d(ij,j) < d(i*,j) and Equation (1), we have d(ij, i* — 1) < d(ij,j). It follows that c2(ij, i* — 1) < 
d(ij,4>(ij,r — 1)), contradicting the definition of 4>(ij,r — 1). 

On the other hand, suppose that ij > i*. By definition of BEST, there must be an index i G [£, r] 
with i < r, d(i,<j)(i,r — 1)) < d(i,j), and i < i* < 4>{i,r — 1) + 1. If i = i*, by Equation (1) we 
have d(i*, (j)(i*,r — 1)) < d(i*,j) < d(<p(i*,r — 1) + 1, j), where the last inequality contradicts the 
definition of i*. Now that i < i* , we have d(i*,j) > d(i,j) > d(i,i* — 1) > d(i,cp(i,r — 1)) > 
d(i*, 4>{i, r — 1)), where (a) the first inequality is by definition of i* , (b) the second inequality is by 
Equation (1) and the first inequality (c) the third inequality is by i* < 4>(i, r — 1) and definition of 
cf)(i, r — 1), and (d) the last inequality is by Equation (1) and the third inequality. It follows from 
d(i*,j) > d(i*,4>(i,r — 1)) and Equation (1) that d(<fi(i,r — 1) + 1, j) > d(i*,j), contradicting the 
definition of i* by i* < 4>(i, r - 1) + 1. □ 

Theorem 1 Algorithm MAIN correctly solves the maximum-density problem. 
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Figure 4: An illustration for Condition Cj . 



Proof. We prove the theorem by showing ij* = Clearly, by tj = ij -i = 1 and Lemma 1, 
the equality holds if j* = jo- The rest of the proof assumes j* > jo. By Lemma 1 and tj* < ij*, it 
suffices to ensure ij*-i < ij*. Assume for contradiction that there is an index j € [jo,j* — 1] with 
ij-l < < ij- By j < j*, we know £j < i**. By Lemma 1 and max(£j, ij-\) < i% < ij < Tj, we 
have d(ij,j) > d(i**,j). It follows from Equation (1) and i*» < ij that d(i**,j) > d(i**,ij - 1). By 
tj* < ij* < ij < rj* and definition of f , we know d(i*,,j*) > d(ij,j*). It follows from i*« < ij and 
Equation (1) that d(i% , ij - 1) > d(i**,j*). Therefore, d(ij,j) >d{i**,j) > d(i**, ij - 1) >d(i**,j*), 
contradicting the definition of j*. □ 

One can verify that the value of i increases by at least one each time Step 3 of BEST is executed. 
Therefore, to implement the algorithm to run in 0(n) time, it suffices to maintain a data structure 
to support 0(l)-time query for each <f>(i, rj — 1) in Step 2 of BEST. 

3 Coping with ineffective width upper bound 

When w mSLX is ineffective, i.e., u> max > w(l,n), we have tj = 1 for all j 6 F. Therefore, the function 
call in Step 3 of MAIN is exactly BEST (ij-\,rj,j). Moreover, during the execution of the function 
call BEST(ij_i, rj, j), the value of i can only be ij-i, <p(ij-\,rj — 1) + 1, 0(0(zj_i,rj — 1) + 1, rj — 
1) + 1, . . . , rj. Suppose that a subroutine call to UPDATE (j) yields an array $ of indices and two 
indices p and q of <I> with p < q and $>[p] = ij-i such that the following condition holds. 

Condition Cj : &[q] = rj and = <f>($[t - 1], rj - 1) + 1 holds for each index t € [p + 1, q\. 

(See Figure 4 for an illustration.) Then, the subroutine call to BEST(£j_i, rj,j) can clearly be re- 
placed by LBEST(j), as defined in Figure 5. That is, LBEST(j) can access the value of each <fi(i, rj — 1) 
by looking up $ in 0(1) time. It remains to show how to implement UPDATE (j) such that all O(n) 
subroutine calls to UPDATE from Step 3 of LMAIN run in overall 0(n) time. The following lemma is 
crucial in ensuring the correctness and efficiency of our implementation shown in Figure 5, where 
Condition Cj -i stands for p = 1, q = 0, and ij Q -i = 1. 

Lemma 2 For each index j 6 J, the following statements hold. 

1. If Condition Cj-% holds right before calling update(j'), then Condition Cj holds right after the 
subroutine call. 

2. If Condition Cj holds right before calling lbest(j'), then the index returned by the function call is 
exactly that returned by BEST(3>[p], $[<?], j). 
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algorithm LMAIN 

1 let p = 1, q = 0, and ij -i = 1; 

2 for j = jo to n do { 

3 call update (j); 

4 letij = LBEST(j); 

5 output (ijj); 
5 } 

function LBEST(j') 

1 while p < q and d($[p], $[p + 1] - 1) < d($[p], j) do 

2 letp = p + 1; 

3 return $[p]; 

subroutine UPDATE (j) 

1 for r = rj-i + 1 to rj do { 

2 while p < q and d($[g - 1], <&[q] - 1) > - 1], r - 1) do 

3 let </ = q — 1; 

4 let g = q + 1; 

5 let <%] = r; 

6 } 



Figure 5: An efficient implementation for the case that w max is ineffective. 

Proof. Statement 1. It is not difficult to verify that with the initialization p = 1, q = 0, and 
ij -i = 1, Condition Cj holds with p = 1 and q > 1 after calling UPDATE (jo). The rest of the 
proof assumes j £ J - {jo}- For each r G [rj_i + 1, r^], one can see from the definition of 4> that 
4>(£, r — 1) is either (f>(£, r — 2) or r — 1. More precisely, we have r — 1) = r — 1 if and only if 
r — 2)) > d(£, r — 1). Furthermore, if (f>(£, r — 2) < r — 2, then one can prove as follows that 
c/>0(£, r - 2) + 1, r - 1) = 0(0(4 r - 2) + 1, r - 2) implies 0(£, r - 1) = 0(£, r - 2). 

Letm = <p(£, r—2). By 0(m+l, r— 1) = <p(m+l, r—2), we have d(m+l, 0(m+l, r— 1)) < 
d(m + l, r — 1). By definition of 4> arid Equation (1), we have m) < d(m, (j)(m + 1, r — 
1)) < d(m + 1, 0(m + 1, r — 1)). As a result, we have d(£, m) < d(m + 1, r — 1), which 
by Equation (1) implies d(£, m) < d(£, r — 1). Thus <fi(£, r — 1) = (f)(£, r — 2). 

Therefore, at the end of each iteration of the for-loop of UPDATE (j), we have that &[q] = r and 
<&[£] = </>(3>[i — 1], r — 1) + 1 holds for each index t £ [p + 1, <?]. (The value of q may change, though.) 
It follows that at the end of the for-loop, Condition Cj holds. 

Statement 2. By Condition Cj, one can easily verify that LBEST(j) is a faithful implementation 
of best($ \p\ , $ [q] , j) . Therefore, the statement holds . □ 

Lemma 3 The implementation LMAIN solves the maximum-density problem for the case with ineffective 
w max in 0{n) time. 

Proof. By Lemma 2(1) and definitions of UPDATE and LBEST, both Condition Cj and $[p] = ij-i 
hold right after the subroutine call to update(j'). By Condition Cj and Lemma 2(2), lbest(j') is 
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a faithful implementation of BEST(<I>[p], $[q], j). Therefore, the correctness of LMAIN follows from 
$[p] = ij-i, &[q] = rj, and Theorem 1. 

As for the efficiency of LMAIN, observe that q — p > —1 holds throughout the execution of 
lmain. Note that each iteration of the while-loops of LBEST and update decreases the value of 
q — p by one. Clearly Step 4 of update is the only place that increases the value of q — p. Since it 
increases the value of q — p by one for 0{n) times, the overall running time of LMAIN is 0{n). □ 

4 Coping with effective width upper bound 

In contrast to the previous simple case, when w max is arbitrary, £j may not always be 1. There- 
fore, the first argument of the function call in Step 3 of main could be lj with £j > ij-i. It seems 
quite difficult to update the corresponding data structure <!> in overall linear time such that both 
$[p] = max(ij_i,fj) and Condition Cj hold throughout the execution of our algorithm. To over- 
come the difficulty, our algorithm sticks with Condition Cj but allows <&[p] > max(zj_i,£j). As a 
result, maxjgj d(ij,j) may be less than max je j d(i*,j). Fortunately, this potential problem can be 
resolved if we simultaneously solve a series of variant versions of the maximum-density segment 
problem. 

4.1 A variant version of the maximum-density segment problem 

Suppose that we are given two indices r and yo with w(r, y ) > w m i n . Let X = [£, r] and Y = [y , y{\ 
be two intervals such that i = £ yo and and y\ is the largest index in J with w(r, yi) < w mSLX . See 
Figure 6 for an illustration. The variant version of the maximum-density segment problem is to 
look for a maximum-density segment over all feasible segments S(x, y) with x G X, y G Y, and 
^rnin < w(x, y) < w max such that d(x, y) is maximized. 

For each y G Y, let rr* be the largest index x G A 7 " with u> m i n < y) < w mSLX that maximizes 
y). Let y* be an index in Y with , y*) = maxj, e y d(x*,y). Although solving the variant 
version can naturally be reduced to computing the index x* for each index y G Y, the required 
running time is more than what we can afford. Instead, we compute an index x y G X with 
u^min < w(x y ,y) < w mSLX for each index y G Y such that x^* = x**. By w(r,yo) > w m in and 
w(r, yi) < ^max/ one can easily see that, for each y G F, r is always the largest index x G X with 
w m in < w(x, y) < Wmax- Our algorithm for solving the variant problem is as shown in Figure 7, 
presented in a way to emphasize the analogy between VMAIN and MAIN. For example, the index 
x y in VMAIN is the counterpart of the index ij in MAIN. Also, the index r in VMAIN plays the 
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algorithm VMAIN(r, yo) 

1 let £ be the smallest index in [1, n] with w(£, yo) < w r _ 

2 let y\ be the largest index in [1, re] with w(r, y\) < w n 

3 let x yo -i = £; 

4 for y = y to yi do { 

5 let x y = BEST(max(x y _i, £ y ), r, y); 

6 output (x y ,y); 

7 } 



Figure 7: Our algorithm for the variant version of the maximum-density segment problem, where 
function BEST is as defined in Figure 2. 

role of the index rj in MAIN. We have the following lemma whose proof is very similar to that of 
Theorem 1. 

Lemma 4 Algorithm VMAIN correctly solves the variant version of the maximum-density problem. 

Proof. We prove the theorem by showing x y * = x**. Clearly by £ yo = x yo -\ = I and Lemma 1, 
the equality holds if y* = yo- The rest of the proof assumes y* > yo. By Lemma 1 and £ y * < x*», it 
suffices to ensure x y *-\ < x** . Assume for contradiction that there is an index y G [yo, y* — 1] with 
Xy-i < x y * < Xy. Byy < y* , we know £ y < x**. By Lemma 1 and max(£ y , x^-i) < x** < x y < r,we 
have d(x y ,y) > d{x* y * ,y). It follows from Equation (1) and x*. < x y that d(x*» , y) > d(x** , x y — 1). 
By l y * < x** < x y < r and definition of y* , we know d(x**,y*) > d(x y ,y*). It follows from 
x*» < x y and Equation (1) that d(x* y ,,x y — 1) > d(x y *,y*). Therefore, d(x y ,y) > d(x*,,y) > 
d(x*, , x y — 1) > d(x** , y*), contradicting the definition of y*. □ 

Again, the challenge lies in supporting each query to r — 1) of BEST in O(l) time during the 
execution of VMAIN. Fortunately, unlike during the execution of MAIN where both parameters of 
c/)(i, r— 1) may change, the second parameter r — 1 is now fixed. Therefore, to support each query to 
4>(i,r— 1) in 0(1) time, we can actually afford 0(r — £ + 1) time to compute a data structure ^ such 
that = <p(i, r — 1) for each i G [£, r — 1]. As a result, the function BEST can be implemented as 
the function VBEST shown in Figure 8. The following lemma ensures the correctness and efficiency 
of our implementation VARIANT shown in Figure 8. 

Lemma 5 The implementation variant correctly solves the variant version of the maximum-density seg- 
ment problem in 0(r — £ + yi — yo + 1) time. 

Proof. Clearly, if = cp(i,r — 1) holds for each index i G [£, r — 1], then VBEST is a faithful 
implementation of BEST. By Lemma 4, the correctness of VARIANT can be ensured by showing 
that after calling INIT(£, r), ^[i] = 4>{i, r) holds for each index i £ [£, r]. By Step 1 of INIT, we have 
\J/[r] = r = (j)(r, r). Now suppose that ^[i] = 4>(i, r) holds for each index i G [x + 1, r] right before 
INIT is about to execute the iteration for index x S [£, r}. It suffices to show ^[x] = <p(x, r) after the 
iteration. Let Z x = {x,^[x + + 1] + + 1] + 1] + 1], . . . ,r}. Let \Z X \ denote the 

cardinality of Z x . We first show 4>{x, r) € Z x as follows. 
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algorithm VARIANT (r, y ) 

1 let £ be the smallest index in [1, n] with w(£, yo) < w- 

2 let y\ be the largest index in [1, re] with w(r, y\) <w x 

3 call lNlT(7,r - 1); 

4 let x yo ^i = £; 

5 for y = y to yi do { 

6 let x y = VBEST(max(a; ?/ _i, £ y ),r,y); 

7 output (x y ,y); 

8 } 

function VBEST(£, r, y) 

1 let x = I; 

2 while x < r and d(x, ^[x]) < d(x, y) do 

3 let a; = ^[x] + 1; 

4 return x; 

subroutine 1NIT(£, r) 

1 let *[r]=r; 

2 for s = r — 1 downto ^ do { 

3 let i = s; 

4 while i < r and d(s, t) > d(s, + 1]) do 

5 lett = <5[t + 1]; 

6 let*[s]=i; 

7 } 



Figure 8: An efficient implementation for the algorithm VMAIN. 

Assume for contradiction that (p(x, r) Z x , i.e., there is an index z G Z x with z < 
4>{x, r) < *&[z + 1] = <p(z + 1, r). By definition of (j> and Equation (1), we have d(z + 

l,<j)(x,r)) > d{z+l,(j)(z+l,r)) > d(i(x,r)+l,(j)(z+l,r)) and d(c/)(x, r) + l, 00+1, r)) > 
^»(z + 1, r)) > (/>(x, r)). By <i(z + 1, <^>(x, r)) > r)) and Equation (1), we 

have d(x, 4>(x, r)) > <i(x, z), contradicting the definition of <p(x, r). 

For any index z S Z x with z < <p(x, r), we know z < r and </>(z + 1, r) = ^[z + 1] < <^>(x, r). By 
(/>(z + l,r) < 0(x,r) < r and definition of 4>(z + l,r), we have d(z + l, <ft(x,r)) > d(z + l, (j)(z + l, r)). 
By definition of <fi(x,r) and Equation (1), we have <i(x,z) > <i(x, </>(x, r)) > d(z + l,4>(x,r)). By 
<i(x, z) > d(z + 1, c/)(z + 1, r)) and Equation (1), we have d(x, z) > d(x, 4>(z + 1, r)). Therefore, if 
z < 4>(x, r), then Step 5 of INIT will be executed to increase the value of z. Observe that 4>(x, r) = 
z < r and ^>[z + 1] > z imply d(x, z) < d(x, <&[z + 1]). It follows that as soon as z = cf)(x, r) holds, 
whether 4>{x, r) = r or not, the value of ty[x] will immediately be set to z at Step 6 of INIT. 

One can see that the running time is indeed 0{r — £ + y\ — yo + 1) by verifying that throughout 
the execution of the implementation, (a) the while-loop of VBEST runs for 0{r — £ + y\ — yo + 1) 
iterations, and (b) the while-loop of INIT runs for 0(r — £ + 1) = 0{r — £ + 1) iterations. To see 
statement (a), just observe that the value of index x (i) never decreases, (ii) stays in [£, r], and (iii) 
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algorithm GENERAL 



1 let p = 1, q = 0, and ij -i = 1; 

2 for j = jo to n do { 

3 call UPDATE (j); 

4 while $ [p] < £j do 

5 letp = p + 1; 

6 if ij-i < $[p] then 

7 call VARIANT ($ [p] , j); 

8 let = LBEST (j); 

9 output 

10 } 



Figure 9: Our algorithm for the general case, where update and LBEST are defined in Figure 5 and 
variant is defined in Figure 8. 



T T T T T 

Figure 10: An illustration for the situation when Steps 6 and 7 of GENERAL are needed. 

increases by at least one each time Step 3 of vbest is executed. As for statement (b), consider 
the iteration with index s of the for-loop of init. Note that if Step 6 of INIT executes t s times 
in this iteration, then \Z S \ = \Z s+ \\ — t s + 1. Since \Z S \ > 1 holds for each s € X, we have 
J2sex *s = 0{r — i + 1), and thus statement (b) holds. □ 

4.2 Our algorithm for the general case 

With the help of VARIANT, we have a linear-time algorithm for solving the original maximum- 
density segment problem as shown in Figure 9. Algorithm GENERAL is obtained by inserting four 
lines of codes (i.e., Steps 4-7 of general) between Steps 3 and 4 of LMAIN in order to handle the 
case ij-i < £j. Specifically, when < £j, we cannot afford to appropriately update the data 
structure $. Therefore, instead of moving i to £j, Steps 4 and 5 move i to <&[p], where p is the 
smallest index with £j < $ [p] . Of course, these two steps may cause our algorithm to overlook 
the possibility of ij G [ij-i, ^[p] — 1], as illustrated in Figure 10. This is when the variant version 
comes in: As shown in the next theorem, we can remedy the problem by calling VARIANT ( <E> [p] , j ) . 

Theorem 2 The linear-time algorithm GENERAL solves the maximum-density segment problem in an on- 
line manner. 

Proof. We prove the correctness of GENERAL by showing that i** ^ ij* implies i% = Xj*. 
By Lemma 2(1), after the subroutine call update (j) at Step 3 of general, Condition Cj holds. 
Clearly, Steps 4 and 5 of general, which may increase the value of p, do not affect the validity of 
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Figure 11: An illustration for showing that the overall running time of all subroutine calls to 

VARIANT (^-, j) in GENERAL is 0(n). 

Condition Cj. Clearly, Steps 6 and 7 do not modify p, q, and Let i'j be the value of $>[p] right be- 
fore executing Step 8 of GENERAL. By Lemma 2(2), the index ij returned by LBEST(j') is the largest 
index in [^,7^] that maximizes d(ij,j). Clearly, ij* = i** implies the correctness of GENERAL. If 
ij* ij*, then there must be an index j G [jo,j*] such that < ij* < ij. It can be proved as 
follows that i% < i'j. 

Assume i'j < ij* for contradiction. It follows from Lemma 2(2) and Equation (1) that 

d(ij,j) > d(ij*,j) > d(i%,ij - 1). By definition of j*, we have d(i** > d(ij,j), 
which by Equation (1) implies d(i%,ij - 1) > d(i*,,j*). Therefore, d(ij,j) > d(i*,,j*), 
contradicting the definition of j*. 

Since < ij* < i'j, we know w(£'j — < w(ij*,j*) < w max . It follows from Lemma 5 that 

there is an index pair (x, y) with u> m i n < w(x, y) < w max and d(x, y) = d(ij* , j*). 

As for the running time, observe that q — p > —1 holds throughout the execution of GENERAL. 
Note that each iteration of the while-loops of general, lbest and update decreases the value of 
q — p by one. Clearly, Step 4 of UPDATE is the only place that increases the value ofq — p. Moreover, 
it increases the value of q — p by one for 0(n) times. Therefore, to show that the overall running 
time of GENERAL is 0{n), it remains to ensure that all those subroutine calls to VARIANT at Step 7 
of GENERAL take overall 0(n) time. Suppose that j and k are two arbitrary indices with k < j such 
that GENERAL makes subroutine calls to VARIANT^, k) and VARIANT(£-, j). Let r' k be the largest 
index in [1, n] with w(i' k ,r' k ) < w mSLX . By Lemma 5, it suffices to show that i' k < ij and r' k < j as 
follows. (See Figure 11.) By definition of GENERAL, we know that ij~\ < ij, which is ensured by 
the situation illustrated in Figure 10. By k < j, we have i' k < ij-i, implying i' k < ij. Moreover, by 
definitions of ij and r' k , one can easily verify that i' k < ij implies r' k < j. 

It is clear that our algorithm shown in Figure 9 is already capable of processing the input 
sequence in an online manner, since the only preprocessing required is to obtain ij, rj, and the 
prefix sums of a\ , a-i , . . . , aj and w\,W2, ■ ■ ■ ,Wj (for the purpose of evaluating the density of any 
segment in 0(1) time), which can easily be computed on the fly. □ 

5 Exploiting sparsity for the uniform case 

In this section, we assume that S is represented by m pairs {a'^n\), (a' 2 ,n2), ■ ■ ■ , (a' m ,n m ) with 
= no < n\ < ri2 < • • • < n m = n to signify that w\ = = ■ ■ ■ = w n = 1 and en = a'j 
holds for all indices i and j with rij_i < i < rij and 1 < j < m. Our algorithm for solving the 
maximum-density problem for the 0(m)-space representable sequence S is shown in Figure 12. 
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algorithm SPARSE 



1 

2 
3 
4 
5 
6 
7 
8 
9 
11 



for A; = 1 to m do 



let n' k = n k - Hk-i) 
let 5" be the length-m sequence (n^ai, n'l), (n 2 a 2 , n' 2 ), . . . , (n^a^, n^); 
output + 1, re,/), where (i',j f ) is an optimal output of GENERAL (w m - m , w mSLX , S 1 ); 

for fe = 1 to m do { 



if n k > w min then 



output {£ nk , n k ) and (r nfc , n k \, 



if n^-i + u; m i n < ra then 



output (n fe _i + 1, n fc _i + w min ) and (n fe _i + 1, min(ra, n k _ x + 

} 



Figure 12: Our algorithm that handles sparse input sequence for the uniform case, where GENERAL 
is defined in Figure 9. 

Theorem 3 Algorithm SPARSE solves the maximum-density problem for the above 0{m)-space repre- 
sentable sequence in 0(m) time. 

Proof. By Theorem 2, SPARSE runs in O(rra) time. Let S(i*,j*) be a feasible segment with 
maximum density. We first show that without loss of generality i* — 1 G {rao,ni, . . . , n m _i} 
or j* G {rai, ri2, ■ ■ ■ , n m } holds. More specifically, we show that if i* — 1 g" {no, n\, . . . , n m _i} 
and j* {rai, ra 2 , . . . , ra m }, then S(i* + + 1) is also a feasible segment with maximum den- 
sity: By i* - 1 g" {n ,ni, . . . ,n m „i}, we know aj*_i = a^. By j* g" {rai, ra 2 , . . . , n m }, we know 
By the optimality of S(i*,j*) r we have cij. > and aj._i < aj», implying 

aj*_i = aj* = Oj* = dj* + i. Therefore, S(i* + 1, j* + 1) is also a maximum-density segment. 

• Case 1: i* — 1 G {no, rai, . . . , ra m _i} and j* G {m, ri2, . . . , ra m }. Clearly, Steps 1-4 of SPARSE 
take care of this case. 

• Case 2: i* — 1 g" {no, rai, . . . , ra m _i} and j* G {rai, n 2 , . . . , n m }. By Equation (1), we know that 
aj*_i = aj* 7^ d(i*,j*) implies d(i*-l,j*) > d(i*, j*) or d(i*+l, j*) > d(i*, j*). If (i*, j*) is not 
discovered by Steps 6 and 7 of SPARSE, then ^* < i* < rj*. Since i* — 1 ^ {no, ni, . . . , n m _i} 
implies a, ; . _i = a j., we know that t,. < i* < rj* imphes d{i* — 1, j'*) = d(i*,j*) = d(i*+l,j*). 
Thus, — 1, j*) is also a feasible segment with maximum density. Clearly, we can continue 
the same argument until having a maximum-density segment S(i, j* ) such that either i-1 6 
{no, ni, . . . , n m _i}, which is handled in Case 1, or i = £j*, which is handled by Steps 6 and 7 
of SPARSE. 

• Case 3: i* — 1 G {no, rai, . . . , ra m ,_i} and j* g" {ni, n 2 , . . . , n m }. By Equation (1), we know that 
aj* = dj* +1 + d{i*,j*) implies d(i*,j* - 1) > d(i*,j*) or d(i*,j* + 1) > d(i*,f). If («*, j*) is 
not discovered by Steps 8 and 9 of SPARSE, then f,» < i* < rj*. Since j* g" {rai,ra2, . . . , n m } 
implies aj. = aj*+i, we know that £j* < i* < rj* implies d(i*,j*-l) = d(i*,j*) = d(i*,j*+l). 
Thus, S(i*,j* + 1) is also a feasible segment with maximum density. Clearly, we can continue 
the same argument until having a maximum-density segment S(i*,j) such that either j G 
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{n\ , ri2, ■ ■ ■ , n rn }, which is handled in Case 1, or i* = £j, which is handled by Steps 8 and 9 of 
SPARSE. 

The theorem is proved. □ 
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